首页> 外文OA文献 >Clustering and Variable Selection in the Presence of Mixed Variable Types and Missing Data
【2h】

Clustering and Variable Selection in the Presence of Mixed Variable Types and Missing Data

机译:混合变量存在下的聚类与变量选择   类型和缺失数据

代理获取
本网站仅为用户提供外文OA文献查询和代理获取服务,本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文,但由于OA文献来源多样且变更频繁,仍可能出现获取不到、文献不完整或与标题不符等情况,如果获取不到我们将提供退款服务。请知悉。

摘要

We consider the problem of model-based clustering in the presence of manycorrelated, mixed continuous and discrete variables, some of which may havemissing values. Discrete variables are treated with a latent continuousvariable approach and the Dirichlet process is used to construct a mixturemodel with an unknown number of components. Variable selection is alsoperformed to identify the variables that are most influential for determiningcluster membership. The work is motivated by the need to cluster patientsthought to potentially have autism spectrum disorder (ASD) on the basis of manycognitive and/or behavioral test scores. There are a modest number of patients(~480) in the data set along with many (~100) test score variables (many ofwhich are discrete valued and/or missing). The goal of the work is to (i)cluster these patients into similar groups to help identify those with similarclinical presentation, and (ii) identify a sparse subset of tests that informthe clusters in order to eliminate unnecessary testing. The proposed approachcompares very favorably to other methods via simulation of problems of thistype. The results of the ASD analysis suggested three clusters to be mostlikely, while only four test scores had high (>0.5) posterior probability ofbeing informative. This will result in much more efficient and informativetesting. The need to cluster observations on the basis of many correlated,continuous/discrete variables with missing values, is a common problem in thehealth sciences as well as in many other disciplines.
机译:在存在许多相关,混合的连续变量和离散变量的情况下,我们考虑了基于模型的聚类问题,其中一些变量可能缺少值。离散变量使用潜在的连续变量方法处理,Dirichlet过程用于构建具有未知数量组分的混合物模型。还执行变量选择以标识对确定集群成员资格最有影响的变量。这项工作的动机是需要根据许多认知和/或行为测验分数来对认为可能患有自闭症谱系障碍(ASD)的患者进行聚类。数据集中有少量患者(〜480),以及许多(〜100)测试得分变量(其中许多是离散值和/或缺失)。这项工作的目标是(i)将这些患者分为相似的组,以帮助识别具有相似临床表现的患者,以及(ii)识别可告知分类的测试稀疏子集,以消除不必要的测试。通过模拟此类问题,所提出的方法与其他方法相比非常有利。 ASD分析的结果表明,最有可能是三个聚类,而只有四个测试分数具有较高的(> 0.5)信息后验概率。这将导致更加有效和信息丰富的测试。在许多具有缺失值的相关,连续/离散变量的基础上对观察结果进行聚类的需求,是健康科学以及许多其他学科中的普遍问题。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
代理获取

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号